Analysis

HDI Aspects:

HDI consequences and implications:

Course of action/recommendation - More educators, Jobs and medical specialists to improve HDI

Human Development Policy

Extra Sources on HDI

Importing libraries

library(dplyr)
library(readxl)
library(tidygeocoder)
library(sf)
library(mapview)
library(RColorBrewer)
library(plotly)

Importing data

data <- read_excel("geo_NCdata.xlsx")

Human Development Index data

hdi_data <- select(data, c("City", "Education", "Income", "Occupation", 
                       "Health Status", "Housing",
                       "latitude", "longitude"))
head(hdi_data)
## # A tibble: 6 x 8
##   City    Education Income Occupation `Health Status` Housing latitude longitude
##   <chr>       <dbl>  <dbl>      <dbl>           <dbl>   <dbl>    <dbl>     <dbl>
## 1 Aggene~      33.1   3.30       60.1            94.0    99.9    -29.2      18.8
## 2 Alexan~      30.4   8.91       44.4            93.9   100.     -28.6      16.5
## 3 Askham~      13.2  23.3        27.0            93.8   100.     -27.0      20.8
## 4 Augrab~      16.5  25.5        29.3            93.1    97.8    -28.5      20.1
## 5 Barkly~      28.4  27.1        38.6            91.0    84.2    -28.5      24.5
## 6 Brandv~      13.4  14.9        43.4            95.4    99.9    -30.5      20.5

Distribution Analysis

fig <- hdi_data %>%
  plot_ly(
    y = ~Education,
    type = 'violin',
    box = list(visible = T),meanline = list(visible = T), x0 = 'Education') 
fig <- fig %>%
  layout(
    title = "Distribution of Education",
    yaxis = list(title = "%", zeroline = F))

fig

Cities/Towns that are not geocoded

hdi_data[rowSums(is.na(hdi_data)) > 0,]$City
## [1] "Delpoortshoop, Northern Cape"    "Olynvenhoutsdrif, Northern Cape"
## [3] "Phillipstown, Northern Cape"     "Soverby, Northern Cape"

Removing Cities that are not geocoded

locations_hdi <- subset(hdi_data, !is.na(hdi_data$longitude) & !is.na(hdi_data$latitude))

K-means Cluster Analysis

Clustering the data

Standardizing data

  • Standardizing (scaling) data to remove variations due to different measurement scales
locations_hdi_scale <- scale(select(locations_hdi,
                                   c("Education", "Income", "Occupation", 
                       "Health Status", "Housing")))

Assessing Clustering Tendency (ACT)

  • ACT evaluates whether the data set contains meaningful clusters or not (feasibility of the cluster analysis)
  • Method: Statistical (Hopkins statistic)
  • The Hopkins statistic is used to assess the clustering tendency of a data set by measuring the probability that a given data set is generated by a uniform data distribution,it tests the spatial randomness of the data.
  • A Hopkins statistic(H) value of about 0.5 means that the data is uniformly distributed
  • Null hypothesis: the data set D is uniformly distributed (i.e., no meaningful clusters)
  • Alternative hypothesis: the data set D is not uniformly distributed (i.e.contains meaningful clusters)
  • If the value of Hopkins statistic is close to zero, then we can reject the null hypothesis and conclude that the data set D is significantly clusterable
#hopkins(locations_hdi_scale, n = nrow(locations_hdi_scale)-1)

Estimating the optimal number of clusters

  • Methods: Elbow method (within sum of square) and Silhouette method
  • library: factoextra
library(factoextra)
fviz_nbclust(locations_hdi_scale, kmeans, method = "wss")

fviz_nbclust(locations_hdi_scale, kmeans, method =  "silhouette")

K-means Clustering

  • 2 number of clusters will be ideal for grouping observation as shown in the estimation methods above
set.seed(123)
locations_hdi_cluster <- kmeans(locations_hdi_scale, 
                               centers = 2, nstart = 25)
library(ggplot2)
library(plotly)

Cluster Visual Assessment

  • Observations are represented by points in the plot, using principal components if ncol(data) > 2.
  • PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible.
ggplotly(fviz_cluster(locations_hdi_cluster, data = locations_hdi_scale) +
           theme_minimal() +
           theme(legend.position = "none") +
           ggtitle("Human Development Index Clusters (Groups)"))

Adding the clusters to the Human Development Data Frame

locations_hdi$Cluster <- as.factor(locations_hdi_cluster$cluster)
head(locations_hdi)
## # A tibble: 6 x 9
##   City    Education Income Occupation `Health Status` Housing latitude longitude
##   <chr>       <dbl>  <dbl>      <dbl>           <dbl>   <dbl>    <dbl>     <dbl>
## 1 Aggene~      33.1   3.30       60.1            94.0    99.9    -29.2      18.8
## 2 Alexan~      30.4   8.91       44.4            93.9   100.     -28.6      16.5
## 3 Askham~      13.2  23.3        27.0            93.8   100.     -27.0      20.8
## 4 Augrab~      16.5  25.5        29.3            93.1    97.8    -28.5      20.1
## 5 Barkly~      28.4  27.1        38.6            91.0    84.2    -28.5      24.5
## 6 Brandv~      13.4  14.9        43.4            95.4    99.9    -30.5      20.5
## # ... with 1 more variable: Cluster <fct>

Cluster Mean

  • Creating a Human Development Index data frame
hdi_clust <- select(locations_hdi, c("Education", "Income", "Occupation", 
                       "Health Status", "Housing"))
  • Computing the cluster mean the different Natural Resources
  • This informs on how natural resources vary by group
  • The cluster centers assist in evaluating the distinctness of clusters. Thereby, suggesting whether or not cluster analysis was executed properly
hdi_clust_table <- aggregate(hdi_clust,
                            by=list(cluster= locations_hdi_cluster$cluster),
                            mean)
hdi_clust_table
##   cluster Education   Income Occupation Health Status  Housing
## 1       1  22.06866 20.69253   34.81301      94.06322 97.42341
## 2       2  32.32615 27.58929   42.14865      91.40413 75.68210

Human Development Index Clusters

# locations_nr %>%
#       group_by(Cluster) %>%
#       summarise(n = n()) %>%
#       arrange(n) %>%
#       mutate(Cluster = factor(Cluster, levels = unique(Cluster))) %>%
#       plot_ly(x = ~n, y = ~Cluster, type = "bar") %>%
#       layout(title = "Natural Resource Grouping", yaxis = list(title = "Cluster"),
#              xaxis = list(title = "Number of Cities/Towns"))

ggplotly(locations_hdi %>%
      group_by(Cluster) %>%
      summarise(No_of_Cities = n()) %>%
      arrange(No_of_Cities) %>%
      mutate(Cluster = factor(Cluster, levels = unique(Cluster))) %>%
      ggplot(aes(x = Cluster, y = No_of_Cities)) +
      geom_bar(stat = "identity",
               fill = "#1f77b4") +
      geom_text(aes(label = No_of_Cities),
                vjust = -0.25) +
      coord_flip() +
      labs(x = "Cluster", 
           y = "Number of Cities/Towns",
           title = "Human Development Grouping (Clusters)") +
      theme_minimal())

Viewing Mapview according to clusters

  • SF object of cluster data for Human Development Index
Human_Development <- st_as_sf(locations_hdi, coords = c("longitude", "latitude"), crs = 4326)
  • Kimberley outlier Education level makes sense, especially since Kimberley is the provincial (Northern Cape) capital
mapview(Human_Development, 
        zcol = "Cluster")